This tidy data set contains 1,599 red wines with 11 variables on the chemical
properties of the wine. At least 3 wine experts rated the quality of each wine,
providing a rating between 0 (very bad) and 10 (very excellent).
df<-read.csv('wineQualityReds.csv')
data_new <-subset(df,select=-c(X))
The dataset consist of 1599 observations of 13 variables.
Variable ‘X’ is the id given for each observation.
At least 3 wine experts rated the quality of each wine, providing a rating
between 0 (very bad) and 10 (very excellent).
Except ‘X’ and ‘quality’, all other variables are of datatype ‘numeric’.
‘X’ and quality are of ‘integer’ datatype.
Input variables (based on physicochemical tests):
1 - fixed acidity (tartaric acid - g / dm^3)
2 - volatile acidity (acetic acid - g / dm^3)
3 - citric acid (g / dm^3)
4 - residual sugar (g / dm^3)
5 - chlorides (sodium chloride - g / dm^3)
6 - free sulfur dioxide (mg / dm^3)
7 - total sulfur dioxide (mg / dm^3)
8 - density (g / cm^3)
9 - pH
10 - sulphates (potassium sulphate - g / dm^3)
11 - alcohol (% by volume)
Output variable (based on sensory data):
12 - quality (score between 0 and 10)
Quality score ranges between 0(being very bad) and 10(being very excellent)
Description of attributes:
1 - fixed acidity: most acids involved with wine or fixed or
nonvolatile (do not evaporate readily).
2 - volatile acidity: the amount of acetic acid in wine, which at too
high of levels can lead to an unpleasant, vinegar taste.
3 - citric acid: found in small quantities, citric acid can
add ‘freshness’ and flavor to wines.
4 - residual sugar: the amount of sugar remaining after fermentation stops,
it’s rare to find wines with less than 1 gram/liter and wines with greater
than 45 grams/liter are considered sweet.
5 - chlorides: the amount of salt in the wine.
6 - free sulfur dioxide: the free form of SO2 exists in equilibrium between
molecular SO2 (as a dissolved gas) and bisulfite ion; it prevents microbial
growth and the oxidation of wine.
7 - total sulfur dioxide: amount of free and bound forms of S02; in low
concentrations, SO2 is mostly undetectable in wine, but at free SO2
concentrations over 50 ppm, SO2 becomes evident in the nose and taste of wine.
8 - density: the density of water is close to that of water depending on
the percent alcohol and sugar content.
9 - pH: describes how acidic or basic a wine is on a scale from 0
(very acidic) to 14 (very basic); most wines are between 3-4 on the pH scale.
10 - sulphates: a wine additive which can contribute to sulfur dioxide
gas (S02) levels, wich acts as an antimicrobial and antioxidant.
11 - alcohol: the percent alcohol content of the wine.
Output variable (based on sensory data):
12 - quality (score between 0 and 10)
Reference : https://s3.amazonaws.com/udacity-hosted-downloads/ud651/wineQualityInfo.txt
## 'data.frame': 1599 obs. of 13 variables:
## $ X : int 1 2 3 4 5 6 7 8 9 10 ...
## $ fixed.acidity : num 7.4 7.8 7.8 11.2 7.4 7.4 7.9 7.3 7.8 7.5 ...
## $ volatile.acidity : num 0.7 0.88 0.76 0.28 0.7 0.66 0.6 0.65 0.58 0.5 ...
## $ citric.acid : num 0 0 0.04 0.56 0 0 0.06 0 0.02 0.36 ...
## $ residual.sugar : num 1.9 2.6 2.3 1.9 1.9 1.8 1.6 1.2 2 6.1 ...
## $ chlorides : num 0.076 0.098 0.092 0.075 0.076 0.075 0.069 0.065 0.073 0.071 ...
## $ free.sulfur.dioxide : num 11 25 15 17 11 13 15 15 9 17 ...
## $ total.sulfur.dioxide: num 34 67 54 60 34 40 59 21 18 102 ...
## $ density : num 0.998 0.997 0.997 0.998 0.998 ...
## $ pH : num 3.51 3.2 3.26 3.16 3.51 3.51 3.3 3.39 3.36 3.35 ...
## $ sulphates : num 0.56 0.68 0.65 0.58 0.56 0.56 0.46 0.47 0.57 0.8 ...
## $ alcohol : num 9.4 9.8 9.8 9.8 9.4 9.4 9.4 10 9.5 10.5 ...
## $ quality : int 5 5 5 6 5 5 5 7 7 5 ...
Lets plot all the variables in data set. This will provide information regarding
the distribution of various chemical properties in the red wine.
The above graphs shows distribution of various variables in the data set.
Volatile acidity, density and pH appeared to be normally distributed.Some of the
distributions like residual sugar and chlorides are long tailed.
To get better understanding of long tailed distribution, we can use log 10
transformations.
The log 10 transformation tranforms the residual sugar and chlorides plots
roughly to normal distributions.
## fixed.acidity volatile.acidity citric.acid residual.sugar
## Min. : 4.60 Min. :0.1200 Min. :0.000 Min. : 0.900
## 1st Qu.: 7.10 1st Qu.:0.3900 1st Qu.:0.090 1st Qu.: 1.900
## Median : 7.90 Median :0.5200 Median :0.260 Median : 2.200
## Mean : 8.32 Mean :0.5278 Mean :0.271 Mean : 2.539
## 3rd Qu.: 9.20 3rd Qu.:0.6400 3rd Qu.:0.420 3rd Qu.: 2.600
## Max. :15.90 Max. :1.5800 Max. :1.000 Max. :15.500
## chlorides free.sulfur.dioxide total.sulfur.dioxide
## Min. :0.01200 Min. : 1.00 Min. : 6.00
## 1st Qu.:0.07000 1st Qu.: 7.00 1st Qu.: 22.00
## Median :0.07900 Median :14.00 Median : 38.00
## Mean :0.08747 Mean :15.87 Mean : 46.47
## 3rd Qu.:0.09000 3rd Qu.:21.00 3rd Qu.: 62.00
## Max. :0.61100 Max. :72.00 Max. :289.00
## density pH sulphates alcohol
## Min. :0.9901 Min. :2.740 Min. :0.3300 Min. : 8.40
## 1st Qu.:0.9956 1st Qu.:3.210 1st Qu.:0.5500 1st Qu.: 9.50
## Median :0.9968 Median :3.310 Median :0.6200 Median :10.20
## Mean :0.9967 Mean :3.311 Mean :0.6581 Mean :10.42
## 3rd Qu.:0.9978 3rd Qu.:3.400 3rd Qu.:0.7300 3rd Qu.:11.10
## Max. :1.0037 Max. :4.010 Max. :2.0000 Max. :14.90
## quality
## Min. :3.000
## 1st Qu.:5.000
## Median :6.000
## Mean :5.636
## 3rd Qu.:6.000
## Max. :8.000
Occurences of different quality levels in dataset:
Quality:Count
3: 10
4: 53
5:681
6:638
7:199
8:18
The dataset consist of 1599 observations of 13 variables(“X”,“fixed.acidity”
,“volatile.acidity”, “citric.acid”,“residual.sugar”,“free.sulfur.dioxide”,
“chlorides”,“total.sulfur.dioxide”, “density”,“pH”,“sulphates”,“alcohol”,
“quality”).
Variable ‘X’ is the id given for each observation.
Quality : Rated the quality of each wine, providing a rating between 0(very bad)
and 10 (very excellent).
Except ‘X’ and ‘quality’, all other variables are of datatype ‘numeric’. ‘X’ and
quality are of ‘integer’ datatype.
From the histograms plotted, we can see that the red wine quality 5 and 6 have
most occurences in the data set.
Volatile acidity, density and pH appeared to be normally distributed.Some of
the distributions are long tailed. Some of the distributions like residual
sugar and chlorides are long tailed.
The fixed acidity varies from 4 to 16 (g/dm^3) with mean value of 8.32.The
volatile acidity varies from 0.1 to 2(g/dm^3) with a mean of 0.527.The
citric acid varies from 0 to 1(g/dm^3) with a mean of 0.271.The residual
sugar varies from 0.9 to 15(g/dm^3) with a mean of 2.539.The chlorides
varies from 0.01 to 7(g/dm^3) with a mean of 0.087.The free.sulfur.dioxide
varies from 1 to 72(mg/dm^3) with a mean of 15.87. The total.sulfur.dioxide
varies from 6 to 289(mg/dm^3) with a mean of 46.47. The density varies
from 0.9 to 2(g/cm^3) with a mean of 0.996.The pH varies from 2 to 5
with a mean of 3.311.The sulphates varies from 0.3 to 2(g/dm^3) with a
mean of 0.6581. The alcohol varies from 8 to 15(%/volume) with a mean of
10.42. The quality varies from 3 to 8 with a mean of 5.636.
The main objective of the study is quality of the red wine based on various
factors.
Residual sugar and chlorides had a long tail distribution. So I scaled the x
axis using log 10.Also, the quality in the dataset varies from 3 to 8.
Lets plot quality of the red wine against all other chemical properties to find
which chemical properties affects the quality of red wine.I have choosen
box plot as they can represent categorical data such as quality in a better way.
The above graphs provided insights to the red wine data set.We can see that some
distributions have higher number of outliers like residual sugar and chlorides
as we saw earlier.From the box plots too, we can see that most of the samples of
red wine in the data set have a quality of 5 and 6.We can clearly see that, with
the increase of alcohol content in red wine, it tends to show better quality.
Similarly, with the decrease in volatile acidity,quality tends to be better.
The box plots also provided information regarding range of values,the median,
the mean values of different chemical properties corresponding to different
quality levels.
Lets analyze this further using correlation plot.This will give us a better
understanding of data and lets see which variables have stronger relationships
with quality of red wine.
The above graph shows correlations among different variables in the data set.
The correlation plot shows correlations among different variables in the data
set.Stronger correlations, darker the color.
alcohol : 0.476
volatile acidity : -0.390
sulphates : 0.251
citric acid : 0.226
After investigating the data set, mainly four varibales were found to exhibit
good correlation with quality.They include alcohol,volatile acidity, sulphates
and citric acid.
Alcohol content has the highest correlation with the red wine quality. The
graph below shows the relationship between the quality and alcohol content.
We have choosen alpha =1/4 and jiter to reduce overplotting.
The volatile acidity is negatively correlated with the quality.
From the graphs we can see an approximate linear relationship between few
variables and quality of the red wine.Alcohol and volatile acidity should be
noted as they showed good linear relationship with quality.
The graphs clearly shows the factors affecting the quality of the red wine.
Good quality red wine tends to have:
Higer alcohol content.
Lower volatile acidity.
Good amount of sulphates and citric acid.
fixed acidity and pH : -0.683
fixed acidity and citric acid : 0.672
fixed acidity and density : 0.668
citric acid and Ph : -0.541
citric acid and volatile acidity : -0.552
alcohol and density : -0.496
We use alpha = 1/4 to reduce overplotting.
The above graph clearly shows there exist an approximate linear relationship
between these chemical properties.
## quality fixed.acidity volatile.acidity citric.acid residual.sugar
## 1 3 7.50 0.845 0.035 2.1
## 2 4 7.50 0.670 0.090 2.1
## 3 5 7.80 0.580 0.230 2.2
## 4 6 7.90 0.490 0.260 2.2
## 5 7 8.80 0.370 0.400 2.3
## 6 8 8.25 0.370 0.420 2.1
## chlorides free.sulfur.dioxide total.sulfur.dioxide density pH
## 1 0.0905 6.0 15.0 0.997565 3.39
## 2 0.0800 11.0 26.0 0.996500 3.37
## 3 0.0810 15.0 47.0 0.997000 3.30
## 4 0.0780 14.0 35.0 0.996560 3.32
## 5 0.0730 11.0 27.0 0.995770 3.28
## 6 0.0705 7.5 21.5 0.994940 3.23
## sulphates alcohol
## 1 0.545 9.925
## 2 0.560 10.000
## 3 0.580 9.700
## 4 0.640 10.500
## 5 0.740 11.500
## 6 0.740 12.150
Here we plot the variables having high correlation with quality, all together.
We will also use a coloring mechanism such that low quality wines are plotted
using light shades while high quality wines are plotted using dark shades. We
will also plot regression lines to get a deep understanding of variation of
quality of red wines with these chemical properties.
We have plotted graphs using the variables alcohol,volatile acidity,density,
fixed acidity and citric acid against quality of the red wine.In all the graphs,
it clearly seen that have higher alcohol content and lower volatile acidity
tends to have better quality.
In the previous sections, we saw that citric acid and volatile acidity are
negatively correlated.This is very mch reflected in graphs plotted as it shows
higher citric acid and lower volatile acidity, better the red wine quality.The
graphs also shows that lower the density and higer the alcohol, we will have
better quality of red wine.
The plot represents variation of quality with alcohol and volatile acidity.
The graphs show that lower volatile acidity and higher alcohol content tends to
show better quality.Volatile acidity is negatively correlated . The lighter
regression line represents the low quality wines while darker line represents
high quality wines. Stronger correlations, darker the color. The red wine quality is highly correlated with the variables alcohol,volatile
acidity,sulphates and citric acid.Red Wines with high alcohol content tends to
show better quality. They show good correlation with the quality of red wine.
alcohol : 0.476
volatile acidity : -0.390
sulphates : 0.251
citric acid : 0.226
The plot represents variation of quality with volatile acidity and citric acid.
The graph shows that the volatile acidity and citric acid are negatively
correlated. Dark regression lines represent high quality wines while light
regression lines represent low quality wines. Low concentration of volatile acid
and high concentration of citric acid tends to show better quality of red wines.
The plot represents variation of quality with sulphates and citric acid.
The graph shows that higher citric acid and sulphate is associated with high
quality wine.The dark regression lines represent high quality wines while
light regression lines represent low quality wines.
From the univariate plots, I was able to know about the distribution of various
chemical properties in the data set.
Some of the distributions were long tailed. So logarithmic transformations
were used to reduce the effects of outliers.Residual sugar and chlorides had a
long tailed distribution.Some of the graphs were overplotted, so I adjusted the
alpha level and figure size.
In bivariate plots,I have used box plots and sctter plots and it have provided
me a great insight to the chemical properties which significantly contributes
to the red wine quality.
The multivarite variate plots have shown which pair of chemical properties tends
to give better quality of red wine.
One of the things which I noted was that the number of samples of wines with
quality 5 and 6 was significantly higher than others.The study can be improved
with greater number of samples.
One of the things I noted was that many samples of red wine had 0 quantity of
citric acid.
We have found correlation between many variables in the dataset.
The quality of red wines were highly correlated with alcohol,volatile acidity,
citric acid and sulphates.The higher alcohol content tends to give better
quality of wine.The volatile acidity was another variable which was affecting
the wine quality negatively.Lower content of volatile acid and higher content
of citric acid were shown to give better quality of wine.
Higher content of sulphates were also seen to be associated with high quality
of wine.To summarise, higher content of alcohol,citric acid,sulphates and
lower volatile acidity tends to give better quality of wine.
We can see good correlation with some other variables like fixed acidity and pH,
fixed acidity and citric acid,fixed acidity and density, citric acid and
volatile acidity.
Even though there shows significant correlation between quality and other
variables, correlation does not mean causation.We cannot say that higher alcohol
content gives better quality of wine.We can only conclude that there is
considerable amount of alcohol in red wines having higher quality.
Future improvement can be done by increasing the number of red wine samples.
Here in this data set, the occurences of wines of quality 5 and 6 are
significantly higher than that of the wines of other qualities.The data must
collected for very low quality wines and very high quality wines. In this
data set, samples of wines having quality 3,4 and 8 are very low.